Chip 1998 September

home *** CD-ROM | disk | FTP | other *** search

/ Chip 1998 September / CHIP Eylül 1998.iso / Slackwar / docs / mini / Multiple-Disks-Layout < prev next >

Wrap

Text File | 1997-03-29 | 130.0 KB | 3,059 lines

Mini_HOWTO: Multi Disk System Tuning Stein Gjoen, sgjoen@nyx.net v0.12b, 23 March 1997 This document describes how best to use multiple disks and partitions for a Linux system. Although some of this text is Linux specific the general approach outlined here can be applied to many other multi tasking operating systems. ______________________________________________________________________ Table of Contents: 1. Introduction 1.1. Copyright 1.2. Disclaimer 1.3. News 1.4. Credits 2. Structure 2.1. Logical structure 2.2. Document structure 3. Drive technologies 3.1. Drives 3.2. Geometry 3.3. Media 3.3.1. Magnetic Drives 3.3.2. Optical drives 3.3.3. Solid State Drives 3.4. Interfaces 3.4.1. MFM and RLL 3.4.2. IDE and ATA 3.4.3. EIDE, Fast-ATA and ATA-2 3.4.4. ATAPI 3.4.5. SCSI 3.5. Cabling 3.6. Host Adapters 3.7. Comparisons 3.8. Future Development 3.9. Recommendations 4. Considerations 4.1. File system features 4.1.1. Swap 4.1.2. Temporary storage ( 4.1.3. Spool areas ( 4.1.4. Home directories ( 4.1.5. Main binaries ( 4.1.6. Libraries ( 4.1.7. Root 4.1.8. DOS etc. 4.2. Explanation of terms 4.2.1. Speed 4.2.2. Reliability 4.2.3. Files 4.3. Technologies 4.3.1. RAID 4.3.2. AFS, Veritas and Other Volume Management Systems 4.3.3. Linux 4.3.4. General File System Consideration 4.3.5. Compression 4.3.6. Physical Track Positioning 5. Other Operating System 5.1. DOS 5.2. Windows 5.3. OS/2 5.4. NT 5.5. Sun OS 5.5.1. Sun OS 4 5.5.2. Sun OS 5 (aka Solaris) 6. Clusters 7. Mounting Points 8. Disk Layout 8.1. Selection 8.2. Mapping 8.3. Optimizing 8.3.1. Optimizing by characteristics 8.3.2. Optimizing by drive parallelising 8.4. Usage requirements 8.5. Servers 8.5.1. Home directories 8.5.2. Anonymous FTP 8.5.3. WWW 8.5.4. Mail 8.5.5. News 8.5.6. Others 8.6. Pitfalls 8.7. Compromises 9. Implementation 9.1. Drives and Partitions 9.2. Partitioning 9.3. Multiple devices ( 9.4. Formatting 9.5. Mounting 10. Maintenance 10.1. Backup 10.2. Defragmentation 10.3. Upgrades 11. Further Information 12. Concluding Remarks 12.1. Coming Soon 12.2. Request for Information 12.3. Suggested Project Work 13. Questions and Answers 14. Bits and Pieces 14.1. Combining 14.2. Interleaved 14.3. Swap partition: to use or not to use 14.4. Mount point and 14.5. SCSI id numbers and names 14.6. Dejanews 14.7. File system structure 15. Appendix A: Partitioning layout table: mounting and linking 16. Appendix B: Partitioning layout table: numbering and sizing 17. Appendix C: Partitioning layout table: partition placement 18. Appendix D: Example: Multipurpose server 19. Appendix E: Example: mounting and linking 20. Appendix F: Example: numbering and sizing 21. Appendix G: Example: partition placement 22. Appendix H: Example II 23. Appendix H: Example III: SPARC Solaris ______________________________________________________________________ 1. Introduction In commemoration of the "Linux Hacker V2.0 - The New Generation" this brand new release is code named the Pink Socks 2 release. After all, socks comes in pairs... New code names will appear as per industry standard guidelines to emphasize the state-of-the-art-ness of this document. This document was written for two reasons, mainly because I got hold of 3 old SCSI disks to set up my Linux system on and I was pondering how best to utilise the inherent possibilities of parallelizing in a SCSI system. Secondly I hear there is a prize for people who write documents... This is intended to be read in conjunction with the Linux Filesystem Structure Standard (FSSTND). It does not in any way replace it but tries to suggest where physically to place directories detailed in the FSSTND, in terms of drives, partitions, types, RAID, file system (fs), physical sizes and other parameters that should be considered and tuned in a Linux system, ranging from single home systems to large servers on the Internet. Even though it is now more than a year since last release of the FSSTND work is still continuing, under a new name, and will encompass more than Linux, fill in a few blanks hinted at in FSSTND version 1.2 as well as other general improvements. The development mailing list is currently private but a general release is hopefully in the near future. The new issue will be named Filesystem Hierarchy Standard (FHS) and will cover more than Linux alone. It is also a good idea to read the Linux Installation guides thoroughly and if you are using a PC system, which I guess the majority still does, you can find much relevant and useful information in the FAQs for the newsgroup comp.sys.ibm.pc.hardware especially for storage media. This is also a learning experience for myself and I hope I can start the ball rolling with this Mini-HOWTO and that it perhaps can evolve into a larger more detailed and hopefully even more correct HOWTO. First of all we need a bit of legalese. Recent development shows it is quite important. 1.1. Copyright This HOWTO is copyrighted 1996 Stein Gjoen. Unless otherwise stated, Linux HOWTO documents are copyrighted by their respective authors. Linux HOWTO documents may be reproduced and distributed in whole or in part, in any medium physical or electronic, as long as this copyright notice is retained on all copies. Commercial redistribution is allowed and encouraged; however, the author would like to be notified of any such distributions. All translations, derivative works, or aggregate works incorporating any Linux HOWTO documents must be covered under this copyright notice. That is, you may not produce a derivative work from a HOWTO and impose additional restrictions on its distribution. Exceptions to these rules may be granted under certain conditions; please contact the Linux HOWTO coordinator at the address given below. In short, we wish to promote dissemination of this information through as many channels as possible. However, we do wish to retain copyright on the HOWTO documents, and would like to be notified of any plans to redistribute the HOWTOs. If you have questions, please contact Greg Hankins, the Linux HOWTO coordinator, at gregh@sunsite.unc.edu via email. 1.2. Disclaimer Use the information in this document at your own risk. I disavow any potential liability for the contents of this document. Use of the concepts, examples, and/or other content of this document is entirely at your own risk. All copyrights are owned by their owners, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark. You are strongly recommended to take a backup of your system before major installation and backups at regular intervals. 1.3. News Since the 0.11 version was released there have been too many changes to list here. The document has grown a lot, actually beyond expectations. There are many new chapters, old sections expanded into separate chapters and many other improvements. I have also upgraded my system to Debian 1.1.11 and have replaced the old Slackware values with the Debian values for disk space requirements for the various directory. As it happens I installed version 1.1.11 just a few days before Debian 1.2 hit the streets. There are no points for guessing what will appear in the next major release of this document. In the mean time I will use Debian as a base for discussions and examples here, though the HOWTO is equally applicable to other distributions, even other operating systems. I have now done a preliminary installation of Debian 1.2.6 and resized some of my values accordingly, more updates are coming later. More news: there has been a fair bit of interest in new kinds of file systems in the comp.os.linux newsgroups, in particular logging, journaling and inherited file systems. Watch out for updates. Projects on volume management is also under way. The old defragmentation program for ext2fs is being updated and there is continuing interests for compression. The latest version number of this document can be gleaned from my plan entry if you do "finger sgjoen@nox.nyx.net" Also, the latest version will be available on my web space on nyx: The Multiple Disk Layout mini-HOWTO Homepage <http://www.nyx.net/~sgjoen/disk.html>. A text-only version as well as the SGML source can also be downloaded there. A nicely formatted postscript version is also available now. Also planned is a series of URLs to helpful software referred to in this document. A mirror in Europe will be announced soon. 1.4. Credits In this version I have the pleasure of acknowledging even more people who have contributed in one way or another: ronnej@ucs.orst.edu cm@kukuruz.ping.at armbru@pond.sub.org R.P.Blake@open.ac.uk neuffer@goofy.zdv.Uni-Mainz.de sjmudd@phoenix.ea4els.ampr.org nat@nataa.fr.eu.org sundbyk@horten.geco-prakla.slb.com gjoen@sn.no mike@i-Connect.Net roth@uiuc.edu Special thanks go to nakano@apm.seikei.ac.jp for doing the Japanese translation, general contributions as well as contributing an example of a computer in an academic setting, which is included at the end of this document. Not many still, so please read through this document, make a contribution and join the elite. If I have forgotten anyone, please let me know. New in this version is an appendix with a few tables you can fill in for your system in order to simplify the design process. Any comments or suggestions can be mailed to my mail address on nyx: sgjoen@nyx.net. So let's cut to the chase where swap and /tmp are racing along hard drive... 2. Structure As this type of document is supposed to be as much for learning as a technical reference document I have rearranged the structure to this end. For the designer of a system it is more useful to have the information presented in terms of the goals of this exercise than from the point of view of the logical layer structure of the devices themselves. Nevertheless this document would not be complete without such a layer structure the computer field is so full of, so I will include it here as an introduction to how it works. It is a long time since the mini in mini-HOWTO could be defended as proper but I am convinced that this document is as long as it needs to be in order to make the right design decisions, and not longer. 2.1. Logical structure This is based on how each layer access each other, traditionally with the application on top and the physical layer on the bottom. It is quite useful to show the interrelationship between each of the layers used in controlling drives. ___________________________________________________________ |__ File structure ( /usr /tmp etc) __| |__ File system (ext2fs, vfat etc) __| |__ Volume management (AFS) __| |__ RAID, concatenation (md) __| |__ Device driver (SCSI, IDE etc) __| |__ Controller (chip, card) __| |__ Connection (cable, network) __| |__ Drive (magnetic, optical etc) __| ----------------------------------------------------------- In the above diagram both volume management and RAID and concatenation are optional layers. The 3 lower layers are in hardware. All parts are discussed at length later on in this document. 2.2. Document structure Most users start out with a given set of hardware and some plans on what they wish to achieve and how big the system should be. This is the point of view I will adopt in this document in presenting the material, starting out with hardware, continuing with design constraints before detailing the design strategy that I have found to work well. I have used this both for my own personal computer at home, a multi purpose server at work and found it worked quite well. In addition my Japanese co-worker in this project have applied the same strategy on a server in an academic setting with similar success. Finally at the end I have detailed some configuration tables for use in your own design. If you have any comments regarding this or notes from your own design work I would like to hear from you so this document can be upgraded. 3. Drive technologies A far more complete discussion on drive technologies for IBM PCs can be found at the home page of The Enhanced IDE/Fast-ATA FAQ <http://thef-nym.sci.kun.nl/~pieterh/storage.html> which is also regularly posted on Usenet News. Here I will just present what is needed to get an understanding of the technology and get you started on your setup. 3.1. Drives This is the physical device where your data lives and although the operating system makes the various types seem rather similar they can in actual fact be very different. An understanding of how it works can be very useful in your design work. Floppy drives fall outside the scope of this document, though should there be a big demand I could perhaps be persuaded to add a little here. 3.2. Geometry Physically disk drives consists of one or more platters containing data that is read in and out using sensors mounted on movable heads that are fixed with respects to themselves. Data transfers therefore happens across all surfaces simultaneously which defines a cylinder of tracks. The drive is also divided into sectors containing a number of data fields. Drives are therefore often specified in terms of its geometry: the number of Cylinders, Heads and Sectors (CHS). For various reasons there is now a number of translations between o the physical CHS of the drive itself o the logical CHS the drive reports to the BIOS or OS o the logical CHS used by the OS Basically it is a mess and a source of much confusion. For more information you are strongly recommended to read the Large Disk mini- HOWTO 3.3. Media The media technology determines important parameters such as read/write rates, seek times, storage size as well as if it is read/write or read only. 3.3.1. Magnetic Drives This is the typical read-write mass storage medium, and as everything else in the computer world, comes in many flavours with different properties. Usually this is the fastest technology and offers read/write capability. The platter rotates with a constant angular velocity (CAV) with a variable physical sector density for more efficient magnetic media area utilisation. In other words, the number of bits per unit length is kept roughly constant by increasing the number of logical sectors for the outer tracks. Seek times are around 10ms, transfer rates quite variable from one type to another but typically 4-40 MB/s. Note that there are several kinds of transfers going on here, and that these are quoted in different units. First of all there is the platter-to-drive cache transfer, usually quoted in Mbits/s. Typical values here is about 50-250 Mbits/s. The second stage is from the built in drive cache to the adapter, and this is typically quoted in MB/s, and typical quoted values here is 3-40 MB/s. Note, however, that this assumed data is already in the cache and hence for maximum readout speed from the drive the effective transfer rate will decrease dramatically. Drives are often described by the geometry or drive parameters which is the number of heads, sectors and cylinders, which is confused by translation schemes between physical and various logical geometries. This is a mine field which is described in painful details in many storage related FAQs. Read and weep. 3.3.2. Optical drives Optical read/write drives exist but are slow and not so common. They were used in the NeXT machine but the low speed was a source for much of the complaints. The low speed is mainly due to the thermal nature of the phase change that represents the data storage. Even when using relatively powerful lasers to induce the phase changes the effects are still slower than the magnetic effect used in magnetic drives. Today many people use CD-ROM drives which, as the name suggests, is read-only. Storage is about 650MB, transfer speeds are variable, depending on the drive but can exceed 1.5MB/s. Data is stored on a spiraling single track so it is not useful to talk about geometry for this. Data density is constant so the drive uses constant linear velocity (CLV). Seek is also slower, about 100ms, partially due to the spiraling track. Recent, high speed drives, use a mix of CLV and CAV in order to maximize performance. This also reduces access time caused by the need to reach correct rotational speed for readout. A new type (DVD) is on the horizon, offering up to about 18GB on a single disk. 3.3.3. Solid State Drives This is a relatively recent addition to the available technology and has been made popular especially in portable computers as well as in embedded systems. Containing no movable parts they are very fast both in terms of access and transfer rates. The most popular type is flash RAM, but also other types of RAM is used. A few years ago many had great hopes for magnetic bubble memories but it turned out to be relatively expensive and is not that common. In general the use of RAM disks are regarded as a bad idea as it is normally more sensible to add more RAM to the motherboard and let the operating system divide the memory pool into buffers, cache, program and data areas. Only in very special cases, such as real time systems with short time margins, can RAM disks be a sensible solution. Flash RAM is today available in several 10's of megabytes in storage and one might be tempted to use it for fast, temporary storage in a computer. There is however a huge snag with this: flash RAM has a finite life time in terms of the number of times you can rewrite data, so putting swap, /tmp or /var/tmp on such a device will certainly shorten its lifetime dramatically. Instead, using flash RAM for directories that are read often but rarely written to, will be a big performance win. In order to get the optimum life time out of flash RAM you will need to use special drivers that will use the RAM evenly and minimize the number of block erases. This example illustrates the advantages of splitting up your directory structure over several devices. Solid state drives have no real cylinder/head/sector addressing but for compatibility reasons this is faked by the driver to give a uniform interface to the operating system. 3.4. Interfaces There is a plethora of interfaces to chose from widely ranging in price and performance. Most motherboards today include IDE interface or better, Intel supports it through the Triton PCI chip set which is very popular these days. Many motherboards also include a SCSI interface chip made by NCR and that is connected directly to the PCI bus. Check what you have and what BIOS support you have with it. 3.4.1. MFM and RLL Once upon a time this was the established technology, a time when 20MB was awesome, which compared to todays sizes makes you think that dinosaurs roamed the Earth with these drives. Like the dinosaurs these are outdated and are slow and unreliable compared to what we have today. Linux does support this but you are well advised to think twice about what you would put on this. One might argue that an emergency partition with a suitable vintage of DOS might be fitting. 3.4.2. IDE and ATA Progress made the drive electronics migrate from the ISA slot card over to the drive itself and Integrated Drive Electronics was borne. It was simple, cheap and reasonably fast so the BIOS designers provided the kind of snag that the computer industry is so full of. A combination of an IDE limitation of 16 heads together with the BIOS limitation of 1024 cylinders gave us the infamous 504MB limit. Following the computer industry traditions again, the snag was patched with a kludge and we got all sorts of translation schemes and BIOS bodges. This means that you need to read the installation documentation very carefully and check up on what BIOS you have and what date it has as the BIOS has to tell Linux what size drive you have. Fortunately with Linux you can also tell the kernel directly what size drive you have with the drive parameters, check the documentation for LILO and Loadlin, thoroughly. Note also that IDE is equivalent to ATA, AT Attachment. IDE uses CPU-intensive Programmed Input/Output (PIO) to transfer data to and from the drives and has no capability for the more efficient Direct Memory Access (DMA) technology. Highest transfer rate is 8.3MB/s. 3.4.3. EIDE, Fast-ATA and ATA-2 These 3 terms are roughly equivalent, fast-ATA is ATA-2 but EIDE additionally includes ATAPI. ATA-2 is what most use these days which is faster and with DMA. Highest transfer rate is increased to 16.6 MB/s. 3.4.4. ATAPI The ATA Packet Interface was designed to support CD-ROM drives using the IDE port and like IDE it is cheap and simple. 3.4.5. SCSI The Small Computer System Interface is a multi purpose interface that can be used to connect to everything from drives, disk arrays, printers, scanners and more. The name is a bit of a misnomer as it has traditionally been used by the higher end of the market as well as in work stations since it is well suited for multi tasking environments. The standard interface is 8 bits wide and can address 8 devices. There is a wide version with 16 bits that is twice as fast on the same clock and can address 16 devices. The host adapter always counts as a device and is usually number 7. The old standard was 5MB/s and the newer fast-SCSI increased this to 10MB/s. Recently ultra-SCSI, also known as Fast-20, arrived with 20 MB/s transfer rates for an 8 bit wide bus. The higher performance comes at a cost that is usually higher than for (E)IDE. The importance of correct termination and good quality cables cannot be overemphasized. SCSI drives also often tend to be of a higher quality than IDE drives. Also adding SCSI devices tend to be easier than adding more IDE drives. There is a number of useful documents you should read if you use SCSI, the SCSI HOWTO as well as the SCSI FAQ posted on Usenet News. SCSI also has the advantage you can connect it easily to tape drives for backing up your data, as well as some printers and scanners. It is even possible to use it as a very fast network between computers while simultaneously share SCSI devices on the same bus. Work is under way but due to problems with ensuring cache coherency between the different computers connected, this is a non trivial task. 3.5. Cabling I do not intend to make too many comments on hardware but I feel I should make a little note on cabling. This might seem like a remarkably low technological piece of equipment, yet sadly it is the source of many frustrating problems. At todays high speeds one should think of the cable more of a an RF device with its inherent demands on impedance matching. If you do not take your precautions you will get a much reduced reliability or total failure. Some SCSI host adapters are more sensitive to this than others. Shielded cables are of course better than unshielded but the price is much higher. With a little care you can get good performance from a cheap unshielded cable. o Use as short cable as possible, but do not forget the 30cm minimum separation for ultra SCSI. o Avoid long stubs between the cable and the drive, connect the plug on the cable directly to the drive without an extension. o Use correct termination for SCSI devices and at the correct position: the end of the SCSI chain. o Do not mix shielded or unshielded cabling, do not wrap cables around metal, try to avoid proximity to metal parts along parts of the cabling. Any such discontinuities can cause impedance mismatching which in turn can cause reflection of signals which increases noise on the cable. 3.6. Host Adapters This is the other end of the interface from the drive, the part that is connected to a computer bus. The speed of the computer bus and that of the drives should be roughly similar, otherwise you have a bottleneck in your system. Connecting a RAID 0 disk-farm to a ISA card is pointless. These days most computers come with 32 bit PCI bus capable of 132MB/s transfers which should not represent a bottleneck for most people in the near future. As the drive electronic migrated to the drives the remaining part that became the (E)IDE interface is so small it can easily fit into the PCI chip set. The SCSI host adapter is more complex and often includes a small CPU of its own and is therefore more expensive and not integrated into the PCI chip sets available today. Technological evolution might change this. Some host adapters come with separate caching and intelligence but as this is basically second guessing the operating system the gains are heavily dependent on which operating system is used. Some of the more primitive ones, that shall remain nameless, experience great gains. Linux, on the other hand, have so much smarts of its own that the gains are much smaller. Mike Neuffer, who did the drivers for the DPT controllers, states that the DPT controllers are intelligent enough that given enough cache memory it will give you a big push in performance and suggests that people who have experienced little gains with smart controllers just have not used a sufficiently intelligent caching controller. 3.7. Comparisons SCSI offers more performance than EIDE but at a price. Termination is more complex but expansion not too difficult. Having more than 4 (or in some cases 2) IDE drives can be complicated, with wide SCSI you can have up to 15. Some SCSI host adapters have several channels thereby multiplying the number of possible drives even further. RLL and MFM is in general too old, slow and unreliable to be of much use. 3.8. Future Development The general trend is for faster and faster devices for every update in the specifications. ATA-3 is just out but does not define faster transfers, that could happen in ATA-4 which is under way. Quantum has already released DMA/33. SCSI-3 is under way and will hopefully be released soon. Faster devices are already being announced, most recently an 80MB/s monster specification has been proposed. This is based around the ultra-2 standard (which used a 40MHz clock) combined with a 16 bits cable. Some manufacturers already announce SCSI-3 devices but this is currently rather premature as the standard is not yet firm. As the transfer speeds increase the saturation point of the PCI bus is getting closer. Currently the 64 bit version has a limit of 264MB/s. The PCI transfer rate will in the future be increased from the current 33MHz to 66MHz, thereby increasing the limit to 528MB/s. Another trend is for larger and larger drives. I hear it is possible to get 55GB on a single drive though this is rather expensive. Currently the optimum storage for your money is about 5GB but also this is continuously increasing. The introduction of DVD will in the near future have a big impact, with nearly 20GB on a single disk you can have a complete copy of even major FTP sites from around the world. The only thing we can be reasonably sure about the future is that even if it won't get any better, it will definitely be bigger. 3.9. Recommendations My personal view is that EIDE is the best way to start out on your system, especially if you intend to use DOS as well on your machine. If you plan to expand your system over many years or use it as a server I would strongly recommend you get SCSI drives. Currently wide SCSI is a little more expensive. You are generally more likely to get more for your money with standard width SCSI. There is also differential versions of the SCSI bus which increases maximum length of the cable. The price increase is even more substantial and cannot therefore be recommended for normal users. In addition to disk drives you can also connect some types of scanners and printers and even networks to a SCSI bus. Also keep in mind that as you expand your system you will draw ever more power, so make sure your power supply is rated for the job and that you have sufficient cooling. Many SCSI drives offer the option of sequential spin-up which is a good idea for large systems. 4. Considerations The starting point in this will be to consider where you are and what you want to do. The typical home system starts out with existing hardware and the newly converted Linux user will want to get the most out of existing hardware. Someone setting up a new system for a specific purpose (such as an Internet provider) will instead have to consider what the goal is and buy accordingly. Being ambitious I will try to cover the entire range. Various purposes will also have different requirements regarding file system placement on the drives, a large multiuser machine would probably be best off with the /home directory on a separate disk, just to give an example. In general, for performance it is advantageous to split most things over as many disks as possible but there is a limited number of devices that can live on a SCSI bus and cost is naturally also a factor. Equally important, file system maintenance becomes more complicated as the number of partitions and physical drives increases. 4.1. File system features The various parts of FSSTND have different requirements regarding speed, reliability and size, for instance losing root is a pain but can easily be recovered. Losing /var/spool/mail is a rather different issue. Here is a quick summary of some essential parts and their properties and requirements. Note that this is just a guide, there can be binaries in etc and lib directories, libraries in bin directories and so on. 4.1.1. Swap Speed Maximum! Though if you rely too much on swap you should consider buying some more RAM. Note, however, that on many PC motherboards the cache will not work on RAM above 128MB. Size Similar as for RAM. Quick and dirty algorithm: just as for tea: 16MB for the machine and 2MB for each user. Smallest kernel run in 1MB but is tight, use 4MB for general work and light applications, 8MB for X11 or GCC or 16MB to be comfortable. (The author is known to brew a rather powerful cuppa tea...) Some suggest that swap space should be 1-2 times the size of the RAM, pointing out that the locality of the programs determines how effective your added swap space is. Note that using the same algorithm as for 4BSD is slightly incorrect as Linux does not allocate space for pages in core. Also remember to take into account the type of programs you use. Some programs that have large working sets, such as finite element modeling (FEM) have huge data structures loaded in RAM rather than working explicitly on disk files. data and computing intensive programs like this will cause excessive swapping if you have less RAM than the requirements. Other types of programs can lock their pages into RAM. This can be for security reasons, preventing copies of data reaching a swap device or for performance reasons such as in a real time module. Either way, locking pages reduces the remaining amount of swappable memory and can cause the system to swap earlier then otherwise expected. Reliability Medium. When it fails you know it pretty quickly and failure will cost you some lost work. You save often, don't you? Note 1 Linux offers the possibility of interleaved swapping across multiple devices, a feature that can gain you much. Check out "man 8 swapon" for more details. However, software raiding swap across multiple devices adds more overheads than you gain. Thus the fstab file might look like this: /dev/sda1 swap swap pri=1 0 0 /dev/sdc1 swap swap pri=1 0 0 Remember that the fstab file is very sensitive to the formatting used, read the man page carefully and do not just cut and paste the lines above. Note 2 Some people use a RAM disk for swapping or some other file systems. However, unless you have some very unusual requirements or setups you are unlikely to gain much from this as this cuts into the memory available for caching and buffering. 4.1.2. Temporary storage (/tmp and /var/tmp) Speed Very high. On a separate disk/partition this will reduce fragmentation generally, though ext2fs handles fragmentation rather well. Size Hard to tell, small systems are easy to run with just a few MB but these are notorious hiding places for stashing files away from prying eyes and quota enforcements and can grow without control on larger machines. Suggested: small home machine: 8MB, large home machine: 32MB, small server: 128MB, and large machines up to 500MB (The machine used by the author at work has 1100 users and a 300MB /tmp directory). Keep an eye on these directories, not only for hidden files but also for old files. Also be prepared that these partitions might be the first reason you might have to resize your partitions. Reliability Low. Often programs will warn or fail gracefully when these areas fail or are filled up. Random file errors will of course be more serious, no matter what file area this is. Files Mostly short files but there can be a huge number of them. Normally programs delete their old tmp files but if somehow an interruption occurs they could survive. Many distributions have a policy regarding cleaning out tmp files at boot time, you might want to check out what your setup is. Note In FSSTND there is a note about putting /tmp on RAM disk. This, however, is not recommended for the same reasons as stated for swap. Also, as noted earlier, do not use flash RAM drives for these directories. One should also keep in mind that some systems are set to automatically clean tmp areas on rebooting. (* That was 50 lines, I am home and dry! *) 4.1.3. Spool areas (/var/spool/news and /var/spool/mail) Speed High, especially on large news servers. News transfer and expiring are disk intensive and will benefit from fast drives. Print spools: low. Consider RAID0 for news. Size For news/mail servers: whatever you can afford. For single user systems a few MB will be sufficient if you read continuously. Joining a list server and taking a holiday is, on the other hand, not a good idea. (Again the machine I use at work has 100MB reserved for the entire /var/spool) Reliability Mail: very high, news: medium, print spool: low. If your mail is very important (isn't it always?) consider RAID for reliability. Files Usually a huge number of files that are around a few KB in size. Files in the print spool can on the other hand be few but quite sizable. Note Some of the news documentation suggests putting all the .overview files on a drive separate from the news files, check out all news FAQs for more information. 4.1.4. Home directories (/home) Speed Medium. Although many programs use /tmp for temporary storage, others such as some news readers frequently update files in the home directory which can be noticeable on large multiuser systems. For small systems this is not a critical issue. Size Tricky! On some systems people pay for storage so this is usually then a question of finance. Large systems such as nyx.net <http://www.nyx.net/> (which is a free Internet service with mail, news and WWW services) run successfully with a suggested limit of 100K per user and 300K as enforced maximum. Commercial ISPs offer typically about 5MB in their standard subscription packages. If however you are writing books or are doing design work the requirements balloon quickly. Reliability Variable. Losing /home on a single user machine is annoying but when 2000 users call you to tell you their home directories are gone it is more than just annoying. For some their livelihood relies on what is here. You do regular backups of course? Files Equally tricky. The minimum setup for a single user tends to be a dozen files, 0.5 - 5 kB in size. Project related files can be huge though. Note You might consider RAID for either speed or reliability. If you want extremely high speed and reliability you might be looking at other operating system and hardware platforms anyway. (Fault tolerance etc.) 4.1.5. Main binaries ( /usr/bin and /usr/local/bin) Speed Low. Often data is bigger than the programs which are demand loaded anyway so this is not speed critical. Witness the successes of live file systems on CD ROM. Size The sky is the limit but 200MB should give you most of what you want for a comprehensive system. A big system, for software development or a multi purpose server should perhaps reserve 500MB both for installation and for growth. Reliability Low. This is usually mounted under root where all the essentials are collected. Nevertheless losing all the binaries is a pain... Files Variable but usually of the order of 10 - 100 kB. 4.1.6. Libraries ( /usr/lib and /usr/local/lib) Speed Medium. These are large chunks of data loaded often, ranging from object files to fonts, all susceptible to bloating. Often these are also loaded in their entirety and speed is of some use here. Size Variable. This is for instance where word processors store their immense font files. The few that have given me feedback on this report about 70MB in their various lib directories. The following ones are some of the largest diskhogs: GCC, Emacs, TeX/LaTeX, X11 and perl. Reliability Low. See point ``Main binaries''. Files Usually large with many of the order of 100 kB in size. Note For historical reasons some programs keep executables in the lib areas. One example is GCC which have some huge binaries in the /usr/lib/gcc/lib hierarchy. 4.1.7. Root Speed Quite low: only the bare minimum is here, much of which is only run at startup time. Size Relatively small. However it is a good idea to keep some essential rescue files and utilities on the root partition and some keep several kernel versions. Feedback suggests about 20MB would be sufficient. Reliability High. A failure here will possibly cause a fair bit of grief and you might end up spending some time rescuing your boot partition. With some practice you can of course do this in an hour or so, but I would think if you have some practice doing this you are also doing something wrong. Naturally you do have a rescue disk? Of course this is updated since you did your initial installation? There are many ready made rescue disks as well as rescue disk creation tools you might find valuable. Presumable investing some time in this saves you from becoming a root rescue expert. Note 1 If you have plenty of drives you might consider putting a spare emergency boot partition on a separate physical drive. It will cost you a little bit of space but if your setup is huge the time saved, should something fail, will be well worth the extra space. Note 2 For simplicity and also in case of emergencies it is not advisable to put the root partition on a RAID level 0 system. Also if you use RAID for your boot partition you have to remember to have the md option turned on for your emergency kernel. 4.1.8. DOS etc. At the danger of sounding heretical I have included this little section about something many reading this document have strong feelings about. Unfortunately many hardware items come with setup and maintenance tools based around those systems, so here goes. Speed Very low. The systems in question are not famed for speed so there is little point in using prime quality drives. Multitasking or multi-threading are not available so the command queueing facility found in SCSI drives will not be taken advantage of. If you have an old IDE drive it should be good enough. The exception is to some degree Win95 and more notably NT which have multi-threading support which should theoretically be able to take advantage of the more advanced features offered by SCSI devices. Size The company behind these operating systems is not famed for writing tight code so you have to be prepared to spend a few tens of MB depending on what version you install of the OS or Windows. With an old version of DOS or Windows you might fit it all in on 50MB. Reliability Ha-ha. As the chain is no stronger than the weakest link you can use any old drive. Since the OS is more likely to scramble itself than the drive is likely to self destruct you will soon learn the importance of keeping backups here. Put another way: "Your mission, should you choose to accept it, is to keep this partition working. The warranty will self destruct in 10 seconds..." Recently I was asked to justify my claims here. First of all I am not calling DOS and Windows sorry excuses for operating systems. Secondly there are various legal issues to be taken into account. Saying there is a connection between the last two sentences are merely the ravings of the paranoid. Surely. Instead I shall offer the esteemed reader a few key words: DOS 4.0, DOS 6.x and various drive compression tools that shall remain nameless. 4.2. Explanation of terms Naturally the faster the better but often the happy installer of Linux has several disks of varying speed and reliability so even though this document describes performance as 'fast' and 'slow' it is just a rough guide since no finer granularity is feasible. Even so there are a few details that should be kept in mind: 4.2.1. Speed This is really a rather woolly mix of several terms: CPU load, transfer setup overhead, disk seek time and transfer rate. It is in the very nature of tuning that there is no fixed optimum, and in most cases price is the dictating factor. CPU load is only significant for IDE systems where the CPU does the transfer itself but is generally low for SCSI, see SCSI documentation for actual numbers. Disk seek time is also small, usually in the millisecond range. This however is not a problem if you use command queueing on SCSI where you then overlap commands keeping the bus busy all the time. News spools are a special case consisting of a huge number of normally small files so in this case seek time can become more significant. There are two main parameters that are of interest here: Seek is usually specified in the average time take for the read/write head to seek from one track to another. This parameter is important when dealing with a large number of small files such as found in spool files. There is also the extra seek delay before the desired sector rotates into position under the head. This delay is dependent on the angular velocity of the drive which is why this parameter quite often is quoted for a drive. Common values are 4500, 5400 and 7200 rpm (rotations per minute). Higher rpm reduces the seek time but at a substantial cost. Also drives working at 7200 rpm have been known to be noisy and to generate a lot of heat, a factor that should be kept in mind if you are building a large array or "disk farm". Transfer is usually specified in megabytes per second. This parameter is important when handling large files that have to be transferred. Library files, dictionaries and image files are examples of this. Drives featuring a high rotation speed also normally have fast transfers as transfer speed is proportional to angular velocity for the same sector density. It is therefore important to read the specifications for the drives very carefully, and note that the maximum transfer speed quite often is quoted for transfers out of the on board cache and not directly from the platter. 4.2.2. Reliability Naturally no-one would want low reliability disks but one might be better off regarding old disks as unreliable. Also for RAID purposes (See the relevant information) it is suggested to use a mixed set of disks so that simultaneous disk crashes becomes less likely. So far I have had only one report of total file system failure but here unstable hardware seemed to be the cause of the problems. 4.2.3. Files The average file size is important in order to decide the most suitable drive parameters. A large number of small files makes the average seek time important whereas for big files the transfer speed is more important. The command queueing in SCSI devices is very handy for handling large numbers of small files, but for transfer IDE is not too far behind SCSI and normally much cheaper than SCSI. 4.3. Technologies In order to decide how to get the most of your devices you need to know what technologies are available and their implications. As always there can be some tradeoffs with respect to speed, reliability, power, flexibility, ease of use and complexity. 4.3.1. RAID This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby decreasing access time and increasing transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other documents and FAQs for more information. For Linux one can set up a RAID system using either software (the md module in the kernel) or hardware, using a Linux compatible controller. Check the documentation for what controllers can be used. A hardware solution is usually faster, and perhaps also safer, but comes at a significant cost. Currently the only supported hardware SCSI RAID controllers are the SmartCache I/III/IV and SmartRAID I/III/IV controller families from DPT. These controllers are supported by the EATA-DMA driver in the standard kernel. This company also has an informative home page <http://www.dpt.com> which also describes various general aspects of RAID and SCSI in addition to the product related information. More information from the author of the DPT controller drivers (EATA* drivers) can be found at his pages on SCSI <http://www.i- connect.net/~mike/scsi> and DPT <http://www.i- connect.net/~mike/scsi/dpt>. RAID comes in many levels and flavours which I will give a brief overview of this here. Much has been written about it and the interested reader is recommended to read more about this in the RAID FAQ. o RAID 0 is not redundant at all but offers the best throughput of all levels here. Data is striped across a number of drives so read and write operations take place in parallel across all drives. On the other hand if a single drive fail then everything is lost. Did I mention backups? o RAID 1 is the most primitive method of obtaining redundancy by duplicating data across all drives. Naturally this is massively wasteful but you get one substantial advantage which is fast access. The drive that access the data first wins. Transfers are not any faster than for a single drive, even though you might get some faster read transfers by using one track reading per drive. Also if you have only 2 drives this is the only method of achieving redundancy. o RAID 2, 3 and 4 are not so common and is not covered here. o RAID 5 offers excellent redundancy without wasteful duplication. It is fast in reading but not so fast for writing. It is normally recommended to use at least 3, preferrably more than 5 drives for this level. There are also hybrids available based on RAID 1 and one other level. Many combinations are possible but I have only seen a few referred to. These are more complex than the above mentioned RAID levels. RAID 0/1 combines striping with duplication which gives very high transfers combined with fast seeks as well as redundancy. The disadvantage is high disk consumption as well as the above mentioned complexity. RAID 1/5 combines the speed and redundancy benefits of RAID5 with the fast seek of RAID1. Redundancy is improved compared to RAID 0/1 but disk consumption is still substantial. Implementing such a system would involve typically more than 6 drives, perhaps even several controllers or SCSI channels. 4.3.2. AFS, Veritas and Other Volume Management Systems Although multiple partitions and disks have the advantage of making for more space and higher speed and reliability there is a significant snag: if for instance the /tmp partition is full you are in trouble even if the news spool is empty, as it is not easy to retransfer quotas across partitions. Volume management is a system that does just this and AFS and Veritas are two of the best known examples. Some also offer other file systems like log file systems and others optimised for reliability or speed. Note that Veritas is not available (yet) for Linux and it is not certain they can sell kernel modules without providing source for their proprietary code, this is just mentioned for information on what is out there. Still, you can check their home page <http://www.veritas.com> to see how such systems function. Derek Atkins, of MIT, ported AFS to Linux and has also set up the Linux AFS mailing List for this which is open to the public. Requests to join the list should go to Request and finally bug reports should be directed to Bug Reports. Important: as AFS uses encryption it is restricted software and cannot easily be exported from the US. AFS is now sold by Transarc and they have set up a www site. The directory structure there has been reorganized recently so I cannot give a more accurate URL than just the Transarc Home Page <http://www.transarc.com> which lands you in the root of the web site. There you can also find much general information as well as a FAQ. Volume management is for the time being an area where Linux is lacking. Hot news: someone has just started a virtual partition system project that will reimplement many of the volume management functions found in IBM's AIX system. 4.3.3. Linux md Kernel Patch There is however one kernel project that attempts to do some of this, md, which has been part of the kernel distributions since 1.3.69. Currently providing spanning and RAID it is still in early development and people are reporting varying degrees of success as well as total wipe out. Use with caution. 4.3.4. General File System Consideration In the Linux world ext2fs is well established as a general purpose system. Still for some purposes others can be a better choice. News spools lend themselves to a log file based system whereas high reliability data might need other formats. This is a hotly debated topic and there are currently few choices available but work is underway. Log file systems also have the advantage of very fast file checking. Mail servers in the 100G class can suffer file checks taking several days before becoming operational after rebooting. The Minix file system is the oldest one, used in some rescue disk systems but otherwise very little used these days. At one time the Xiafs was a strong contender to the standard for Linux but seems to have fallen behind these days. Adam Richter from Yggdrasil posted recently that they have been working on a compressed log file based system but that this project is currently on hold. Nevertheless a non-working version is available on their FTP server. Check out the yggdrasil ftp server <ftp://ftp.yggdrasil.com/private/adam> where special patched versions of the kernel can be found. Hopefully this will be rolled into the mainstream kernel in the near future. There is room for access control lists (ACL) and other unimplemented features in the existing ext2fs, stay tuned for future updates. There has been some talk about adding on the fly compression too. There is also an encrypted file system available but again as this is under export control from the US, make sure you get it from a legal place. File systems is an active field of academic and industrial research and development, the results of which are quite often freely available. Linux has in many cases been a development tool in such activities so you can expect a lot of continuous work in this field, stay tuned for the latest development. 4.3.5. Compression Disk versus file compression is a hotly debated topic especially regarding the added danger of file corruption. Nevertheless there are several options available for the adventurous administrators. These take on many forms, from kernel modules and patches to extra libraries but note that most suffer various forms of limitations such as being read-only. As development takes place at neck breaking speed the specs have undoubtedly changed by the time you read this. As always: check the latest updates yourself. Here only a few references are given. o DouBle features file compression with some limitations. o Zlibc adds transparent on-the-fly decompression of files as they load. o there are many modules available for reading compressed files or partitions that are native to various other operating systems though currently most of these are read-only. Also there is the user file system (userfs) that allows FTP based file system and some compression (arcfs) plus fast prototyping and many other features. Recent kernels feature the loop or loopback device which can be used to put a complete file system within a file. There are some possibilities for using this for making new file systems with compression, tarring, encryption etc. Note that this device is unrelated to the network loopback device. Very recently a compression package that extends ext2fs was announced. It is still under testing and will therefore mainly be of interest for kernel hackers but should soon gain stability for wider use. 4.3.6. Physical Track Positioning This trick used to be very important when drives were slow and small, and some file systems used to take the varying characteristics into account when placing files. Although higher overall speed, on board drive and controller caches and intelligence has reduced the effect of this. Nevertheless there is still a little to be gained even today. As we know, "world dominance" is soon within reach but to achieve this "fast" we need to employ all the tricks we can use To understand the strategy we need to recall this near ancient piece of knowledge and the properties of the various track locations. This is based on the fact that transfer speeds generally increase for tracks further away from the spindle, as well as the fact that it is faster to seek to or from a central tracks than to or from the inner or outer tracks. Most drives use disks running at constant angular velocity but use (fairly) constant data density across all tracks. This means that you will get much higher transfer rates on the outer tracks than on the inner tracks; a characteristics which fits the requirements for large libraries well. Newer disks use a logical geometry mapping which differs from the actual physical mapping which is transparently mapped by the drive itself. This makes the estimation of the "middle" tracks a little harder. Inner tracks are usually slow in transfer, and lying at one end of the seeking position it is also slow to seek to. This is more suitable to the low end directories such as DOS, root and print spools. Middle tracks are on average faster with respect to transfers than inner tracks and being in the middle also on average faster to seek to. This characteristics is ideal for the most demanding parts such as swap, /tmp and /var/tmp. Outer tracks have on average even faster transfer characteristics but like the inner tracks are at the end of the seek so statistically it is equally slow to seek to as the inner tracks. Large files such as libraries would benefit from a place here. Hence seek time reduction can be achieved by positioning frequently accessed tracks in the middle so that the average seek distance and therefore the seek time is short. This can be done either by using fdisk or cfdisk to make a partition on the middle tracks or by first making a file (using dd) equal to half the size of the entire disk before creating the files that are frequently accessed, after which the dummy file can be deleted. Both cases assume starting from an empty disk. The latter trick is suitable for news spools where the empty directory structure can be placed in the middle before putting in the data files. This also helps reducing fragmentation a little. This little trick can be used both on ordinary drives as well as RAID systems. In the latter case the calculation for centring the tracks will be different, if possible. Consult the latest RAID manual. 5. Other Operating System Many Linux users have several operating systems installed, often necessitated by hardware setup systems that run under other operating systems, typically DOS or some flavour of Windows. A small section on how best to deal with this is therefore included here. 5.1. DOS Leaving aside the debate on weather or not DOS qualifies as an operating system one can in general say that it has little sophistication with respect to disk operations. The more important result of this is that there can be severe difficulties in running various versions of DOS on large drives, and you are therefore strongly recommended in reading the large Drives mini-HOWTO. One effect is that you are often better off placing DOS on low track numbers. Having been designed for small drives it has a rather unsophisticated file system (FAT) which when used on large drives will allocate enormous block sizes. It is also prone to block fragmentation which will after a while cause excessive seeks and slow effective transfers. One solution to this is to use a defragmentation program regularly but it is strongly recommended to back up data and verify the disk before defragmenting. All versions of DOS have chkdsk that can do some disk checking, newer versions also have scandisk which is somewhat better. There are many defragmentation programs available, some versions have one called defrag. Norton Utilities have a large suite of disk tools and there are many others available too. As always there are snags, and this particular snake in our drive paradise is called hidden files. Some vendors started to use these for copy protection schemes and would not take kindly to being moved to a different place on the drive, even if it remained in the same place in the directory structure. The result of this was that newer defragmentation programs will not touch any hidden file, which in turn reduces the effect of defragmentation. Being a single tasking, single threading and single most other things operating system there is very little gains in using multiple drives unless you use a drive controller with built in RAID support of some kind. There are a few utilities called join and subst which can do some multiple drive configuration but there is very little gains for a lot of work. Some of these commands have been removed in newer versions. In the end there is very little you can do, but not all hope is lost. Many programs need fast, temporary storage, and the better behaved ones will look for environment variables called TMPDIR or TEMPDIR which you can set to point to another drive. This is often best done in autoexec.bat. ______________________________________________________________________ SET TMPDIR=E:/TMP ______________________________________________________________________ Not only will this possibly gain you some speed but also it can reduce fragmentation. 5.2. Windows Most of the above points are valid for Windows too, with the exception of Windows95 which apparently has better disk handling, which will get better performance out of SCSI drives. A useful thing is the introduction of long filenames, to read these from Linux you will need the vfat file system for mounting these partitions. The most important thing is the introduction of the new file system FAT32 which is better suited to large drives. The snag is that there is very little support for this today, not even in NT 4.0 or many drive utility systems. A stable driver for Linux is coming soon but is not yet ready for prime time. Stay tuned for updates. Disk fragmentation is still a problem. Some of this can be avoided by doing a defragmentation immediately before and immediately after installing large programs or systems. I use this scheme at work and have found it to work quite well. Windows also use swap drives, redirecting this to another drive can give you some performance gains. There are several mini-HOWTOs telling you how best to share swap space between various operating systems. 5.3. OS/2 The only special note here is that you can get a file system driver for OS/2 that can read an ext2fs partition. 5.4. NT This is a more serious system featuring most buzzwords known to marketing. It is well worth noting that it features software striping and other more sophisticated setups. Check out the drive manager in the control panel. I do not have easy access to NT, more details on this can take a bit of time. One important snag was recently reported by acahalan at cs.uml.edu : (reformatted from a Usenet News posting) NT DiskManager has a serious bug that can corrupt your disk when you have several (more than one?) extended partitions. Microsoft provides an emergency fix program at their web site. See the knowledge base <http://www.microsoft.com/kb/> for more. (This affects Linux users, because Linux users have extra partitions) 5.5. Sun OS There is a little bit of confusion in this area between Sun OS vs. Solaris. Strictly speaking Solaris is just Sun OS 5.x packaged with Openwindows and a few other things. If you run Solaris, just type uname -a to see your version. Parts of the reason for this confusion is that Sun Microsystems used to use an OS from the BSD family, albeight with a few bits and pieces from elsewhere as well as things made by themselves. This was the situation up to Sun OS 4.x.y when they did a "strategic roadmap decision" and decided to switch over to the official Unix, System V, Release 4, and Sun OS 5 was borne. This made a lot of people unhappy. Also this was bundled with other things and marketed under the name Solaris, which currently stands at release 2.5.1 beta. 5.5.1. Sun OS 4 This is quite familiar to most Linux users. Note however that the file system structure is quite different and does not conform to FSSTND so any planning must be based on the traditional structure. You can get some information by the man page on this: man hier. This is, like most manpages, rather brief but should give you a good start. If you are still confused by the structure it will at least be at a higher level. 5.5.2. Sun OS 5 (aka Solaris) this comes with a snazzy installation system that runs under Openwindows, it will help you in partitioning and formatting the drives before installing the system from CD-ROM. It will also fail if your drive setup is too far out, and as it takes a complete installation run from a full CD-ROM in a 1x only drive this failure will dawn on you after too long time. That is the experience we had where I work. Instead we installed everything onto one drive and then afterwards moved things across later. The default settings are sensible for most things, yet there remains a little oddity: swap drives. Even though the official manual recommends multiple swap drives (which are used in a similar fashion as on Linux) the default is to use only a single drive. It is recommended to change this as soon as possible. Sun OS 5 offers also a file system especially designed for temporary files, tmpfs. This is a kind of souped up RAM disk, and like ordinary RAM disks the contents is lost when the power goes. If space is scarce parts of the pseudo drive is swapped out, so in effect you store temporary files on the swap partition. Linux does not have such a file system; it has been discussed in the past but opinions were mixed. I would be interested in hearing comments on this. 6. Clusters In this section I will briefly touch on the ways machines can be connected together but this is so big a topic it could be a separate HOWTO in its own right, hint, hint. Also, strictly speaking, this section lies outside the scope of this HOWTO, so if you feel like getting fame etc. you could contact me and take over this part and turn it into a new document. These days computers gets outdated at an incredible rate. There is however no reason why old hardware could not be put to good use with Linux. Using an old and otherwise outdated computer as a network server can be both useful in its own right as well as a valuable educational exercise. Such a local networked cluster of computers can take on many forms but to remain within the charter of this HOWTO I will limit myself to the disk strategies. Nevertheless I would hope someone else could take on this topic and turn it into a document on its own. This is an exciting area of activity today, and many forms of clustering is available today, ranging from automatic workload balancing over local network to more exotic hardware such as Scalable Coherent Interface (SCI) which gives a tight integration of machines, effectively turning them into a single machine. Various kinds of clustering has been available for larger machines for some time and the VAXcluster is perhaps a well known example of this. Clustering is done usually in order to share resources such as disk drives, printers and terminals etc, but also processing resources equally transparently between the computational nodes. There is no universal definition of clustering, in here it is taken to mean a network of machines that combine their resources to serve users. Admittedly this is a rather loose definition but this will change later. These days also Linux offers some clustering features but for a starter I will just describe a simple local network. It is a good way of putting old and otherwise unusable hardware to good use, as long as they can run Linux or something similar. One of the best ways of using an old machine is as a network server in which case the effective speed is more likely to be limited by network bandwidth rather than pure computational performance. For home use you can move work like o news o mail o web proxy o printer server o modem server (PPP, SLIP, FAX, Voice mail You can also NFS mount drives from the server onto your workstation thereby reducing drive space requirements. Still read the FSSTND to see what directories should not be exported. The best candidates for exporting to all machines are /usr and /var/spool. Most of the time even slow disks will deliver sufficient performance. On the other hand, if you do processing directly on the disks on the server or have very fast networking, you might want to rethink your strategy and use faster drives. Searching features on a web server or news database searches are two examples of this. Such a network can be an excellent way of learning system administration and building up your own toaster network, as it often is called. You can get more information on this in other HOWTOs but there are two important things you should keep in mind: o Do not pull IP numbers out of thin air. Configure your inside net using IP numbers reserved for private use, and use your network server as a router that handles this IP masquerading. o remember that if you additionally configure the router as a firewall you might not be able to get to your own data from the outside, depending on the firewall configuration. The nyx network provides an example of a cluster in the sense defined here. It consists of the following machines: nyx is one of the two user login machines and also provides some of the networking services. nox (aka nyx10) is the main user login machine and is also the mail server. noc is a dedicated news server. The news spool is made accessible through NFS mounting to nyx and nox. arachne (aka www) is the web server. Web pages are written by NFS mounting onto nox. There are also some more advanced clustering projects going, notably o The Beowolf Project <http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html> o The Genoa Active Message Machine (GAMMA) <http://www.disi.unige.it/project/gamma/> High-tech clustering requires hi-tech interconnect, and SCI is one of them. To find out more you can either look up the home page of Dolphin Interconnect Solutions <http://www.dolphinics.no/> which is one of the main actors in this field, or you can have a look at scizzl <http://www.scizzl.com/>. 7. Mounting Points In designing the disk layout it is important not to split off the directory tree structure at the wrong points, hence this section. As it is highly dependent on the FSSTND it has been put aside in a separate section, and will most likely have to be totally rewritten when FHS is released. Nobody knows when that will happen, and at the time of writing this a debate of near-religious qualities is taking place on the mailing list. In the meanwhile this will do. Remember that this is a list of where a separation can take place, not where it has to be. As always, good judgement is always required. Again only a rough indication can be given here. The values indicate 0=don't separate here 1=not recommended 4=useful 5=recommended In order to keep the list short, the uninteresting parts are removed. Directory Suitability / | +-bin 0 +-boot 0 +-dev 0 +-etc 0 +-home 5 +-lib 0 +-mnt 0 +-proc 0 +-root 0 +-sbin 0 +-tmp 5 +-usr 5 | \ | +-X11R6 3 | +-bin 3 | +-lib 4 | +-local 4 | | \ | | +bin 2 | | +lib 4 | +-src 3 | +-var 5 \ +-adm 0 +-lib 2 +-lock 1 +-log 1 +-preserve 1 +-run 1 +-spool 4 | \ | +-mail 3 | +-mqueue 3 | +-news 5 | +-smail 3 | +-uucp 3 +-tmp 5 There is of course plenty of adjustments possible, for instance a home user would not bother with splitting off the /var/spool hierarchy but a serious ISP should. The key here is usage. 8. Disk Layout With all this in mind we are now ready to embark on the layout. I have based this on my own method developed when I got hold of 3 old SCSI disks and boggled over the possibilities. At the end of this document there is an appendix with a few blank forms that you can fill in to help you decide and design your system. The following few paragraphs will refer to them. 8.1. Selection Determine your needs and set up a list of all the parts of the file system you want to be on separate partitions and sort them in descending order of speed requirement and how much space you want to give each partition. The table in appendix A is a useful tool to select what directories you should put on different partitions. It is sorted in a logical order with space for your own additions and notes about mounting points and additional systems. It is therefore NOT sorted in order of speed, instead the speed requirements are indicated by bullets ('o'). If you plan to RAID make a note of the disks you want to use and what partitions you want to RAID. Remember various RAID solutions offers different speeds and degrees of reliability. (Just to make it simple I'll assume we have a set of identical SCSI disks and no RAID) 8.2. Mapping Then we want to place the partitions onto physical disks. The point of the following algorithm is to maximise parallelizing and bus capacity. In this example the drives are A, B and C and the partitions are 987654321 where 9 is the partition with the highest speed requirement. Starting at one drive we 'meander' the partition line over and over the drives in this way: A : 9 4 3 B : 8 5 2 C : 7 6 1 This makes the 'sum of speed requirements' the most equal across each drive. The tables in the appendices are designed to simplify the mapping process. Note the speed characteristics of your drives and note each directory under the appropriate column. Be prepared to shuffle directories, partitions and drives around a few times before you are satisfied. After that it is recommended to sort this list according to partition numbers into the table in appendix C and to use this when running the partitioning program (fdisk or cfdisk) and when doing the installation. 8.3. Optimizing After this there are usually a few partitions that have to be 'shuffled' over the drives either to make them fit or if there are special considerations regarding speed, reliability, special file systems etc. Nevertheless this gives what this author believes is a good starting point for the complete setup of the drives and the partitions. In the end it is actual use that will determine the real needs after we have made so many assumptions. After commencing operations one should assume a time comes when a repartitioning will be beneficial. For instance if one of the 3 drives in the above mentioned example is very slow compared to the two others a better plan would be as follows: A : 9 6 5 B : 8 7 4 C : 3 2 1 8.3.1. Optimizing by characteristics Often drives can be similar in apparent overall speed but some advantage can be gained by matching drives to the file size distribution and frequency of access. Thus binaries are suited to drives with fast access that offer command queueing, and libraries are better suited to drives with larger transfer speeds where IDE offers good performance for the money. 8.3.2. Optimizing by drive parallelising Avoid drive contention by looking at tasks: for instance if you are accessing /usr/local/bin chances are you will soon also need files from /usr/local/lib so placing these at separate drives allows less seeking and possible parallel operation and drive caching. It is quite possible that choosing what may appear less than ideal drive characteristics will still be advantageous if you can gain parallel operations. Identify common tasks, what partitions they use and try to keep these on separate physical drives. Just to illustrate my point I will give a few examples of task analysis here. Office software such as editing, word processing and spreadsheets are typical examples of low intensity software both in terms of CPU and disk intensity. However, should you have a single server for a huge number of users you should not forget that most such software have auto save facilities which cause extra traffic, usually on the home directories. Splitting users over several drives would reduce contention. News readers also feature auto save features on home directories so ISPs should consider separating home directories, news spool and .overview files on separate drives. Database applications can be demanding both in terms of drive usage and speed requirements. The details are naturally application specific, read the documentation carefully with disk requirements in mind. Also consider RAID both for performance and reliability. E-mail reading and sending involves home directories as well as in- and outgoing spool files. If possible keep home directories and spool files on separate drives. If you are a mail server or a mail hub consider putting in- and outgoing spool directories on separate drives. Software development can require a large number of directories for binaries, libraries, include files as well as source and project files. If possible split as much as possible across separate drives. On small systems you can place /usr/src and project files on the same drive as the home directories. Web browsing is becoming more and more popular. Many browsers have a local cache which can expand to rather large volumes. As this is used when reloading pages or returning to the previous page, speed is quite important here. If however you are connected via a well configured proxy server you do not need more than typically a few megabytes per user for a session. 8.4. Usage requirements When you get a box of 10 or so CD-ROMs with a Linux distribution and the entire contents of the big FTP sites it can be tempting to install as much as your drives can take. Soon, however, one would find that this leaves little room to grow and that it is easy to bite over more than can be chewed, at least in polite company. Therefore I will make a few comments on a few points to keep in mind when you plan out your system. Comments here are actively sought. Testing Linux is simple and you don't even need a hard disk to try it out, if you can get the boot floppies to work you are likely to get it to work on your hardware. If the standard kernel does not work for you, do not forget that often there can be special boot disk versions available for unusual hardware combinations that can solve your initial problems until you can compile your own kernel. Learning about operating system is something Linux excels in, there is plenty of documentation and the source is available. A single drive with 50MB is enough to get you started with a shell, a few of the most frequently used commands and utilities. Hobby use or more serious learning requires more commands and utilities but a single drive is still all it takes, 500MB should give you plenty of room, also for sources and documentation. Serious software development or just serious hobby work requires even more space. At this stage you have probably a mail and news feed that requires spool files and plenty of space. Separate drives for various tasks will begin to show a benefit. At this stage you have probably already gotten hold of a few drives too. Drive requirements gets harder to estimate but I would expect 2-4GB to be plenty, even for a small server. Servers come in many flavours, ranging from mail servers to full sized ISP servers. A base of 2GB for the main system should be sufficient, then add space and perhaps also drives for separate features you will offer. Cost is the main limiting factor here but be prepared to spend a bit if you wish to justify the "S" in ISP. Admittedly, not all do it. 8.5. Servers Big tasks requires big drives and a separate section here. If possible keep as much as possible on separate drives. Some of the appendices detail the setup of a small departmental server for 10-100 users. Here I will present a few consideration for the higher end servers. In general you should not be afraid of using RAID, not only because it is fast and safe but also because it can make growth a little less painful. All the notes below come as additions to the points mentioned earlier. Popular servers rarely just happens, rather they grow over time and this demands both generous amounts of disk space as well as a good net connection. In many of these cases it might be a good idea to reserve entire SCSI drives, in singles or as arrays, for each task. This way you can move the data should the computer fail. Note that transferring drives across computers is not simple and might not always work, especially in the case of IDE drives. Drive arrays require careful setup in order to reconstruct the data correctly, so you might want to keep a paper copy of your fstab file as well as a note of SCSI IDs. 8.5.1. Home directories Estimate how many drives you will need, if this is more than 2 I would recommend RAID, strongly. If not you should separate users across your drives dedicated to users based on some kind of simple hashing algorithm. For instance you could use the first 2 letters in the user name, so jbloggs is put on /u/j/b/jbloggs where /u/j is a symbolic link to a physical drive so you can get a balanced load on your drives. 8.5.2. Anonymous FTP This is an essential service if you are serious about service. Good servers are well maintained, documented, kept up to date, and immensely popular no matter where in the world they are located. The big server ftp.funet.fi is an excellent example of this. In general this is not a question of CPU but of network bandwidth. Size is hard to estimate, mainly it is a question of ambition and service attitudes. I believe the big archive at ftp.cdrom.com is a *BSD machine with 50GB disk. Also memory is important for a dedicated FTP server, about 256MB RAM would be sufficient for a very big server, whereas smaller servers can get the job done well with 64MB RAM. Network connections would still be the most important factor. 8.5.3. WWW For many this is the main reason to get onto the Internet, in fact many now seem to equate the two. In addition to being network intensive there is also a fair bit of drive activity related to this, mainly regarding the caches. Keeping the cache on a separate, fast drive would be beneficial. Even better would be installing a caching proxy server. This way you can reduce the cache size for each user and speed up the service while at the same time cut down on the bandwidth requirements. With a caching proxy server you need a fast set of drives, RAID0 would be ideal as reliability is not important here. Higher capacity is better but about 2GB should be sufficient for most. Remember to match the cache period to the capacity and demand. Too long periods would on the other hand be a disadvantage, if possible try to adjust based on the URL. For more information check up on the most used servers such as Harvest, Squid and the one from Netscape. 8.5.4. Mail Handling mail is something most machines do to some extent. The big mail servers, however, come into a class of its own. This is a demanding task and a big server can be slow even when connected to fast drives and a good net feed. In the Linux world the big server at vger.rutgers.edu is a well known example. Unlike a news service which is distributed and which can partially reconstruct the spool using other machines as a feed, the mail servers are centralised. This makes safety much more important, so for a major server you should consider a RAID solution with emphasize on reliability. Size is hard to estimate, it all depends on how many lists you run as well as how many subscribers you have. 8.5.5. News This is definitely a high volume task, and very dependent on what news groups you subscribe to. On nyx there is a fairly complete feed and the spool files consume about 17GB. The biggest groups are no doubt in the alt.binary.* hierarchy, so if you for some reason decide not to get these you can get a good service with perhaps 12GB. Still others, that shall remain nameless, feel 2GB is sufficient to claim ISP status. In this case news expires so fast I feel the spelling IsP is barely justified. 8.5.6. Others There are many services available on the net and even though many have been put somewhat in the shadows by the web. Nevertheless, services like archie, gopher and wais just to name a few, still exist and remain valuable tools on the net. If you are serious about starting a major server you should also consider these services. Determining the required volumes is hard, it all depends on popularity and demand. Providing good service inevitably has its costs, disk space is just one of them. 8.6. Pitfalls The dangers of splitting up everything into separate partitions are briefly mentioned in the section about volume management. Still, several people have asked me to emphasize this point more strongly: when one partition fills up it cannot grow any further, no matter if there is plenty of space in other partitions. In particular look out for explosive growth in the news spool (/var/spool/news). For multi user machines with quotas keep an eye on /tmp and /var/tmp as some people try to hide their files there, just look out for filenames ending in gif or jpeg... In fact, for single physical drives this scheme offers very little gains at all, other than making file growth monitoring easier (using 'df') and physical track positioning. Most importantly there is no scope for parallel disk access. A freely available volume management system would solve this but this is still some time in the future. However, when more specialised file systems become available even a single disk could benefit from being divided into several partitions. 8.7. Compromises One way to avoid the aforementioned pitfalls is to only set off fixed partitions to directories with a fairly well known size such as swap, /tmp and /var/tmp and group together the remainders into the remaining partitions using symbolic links. Example: a slow disk (slowdisk), a fast disk (fastdisk) and an assortment of files. Having set up swap and tmp on fastdisk; and /home and root on slowdisk we have (the fictitious) directories /a/slow, /a/fast, /b/slow and /b/fast left to allocate on the partitions /mnt.slowdisk and /mnt.fastdisk which represents the remaining partitions of the two drives. Putting /a or /b directly on either drive gives the same properties to the subdirectories. We could make all 4 directories separate partitions but would lose some flexibility in managing the size of each directory. A better solution is to make these 4 directories symbolic links to appropriate directories on the respective drives. Thus we make /a/fast point to /mnt.fastdisk/a/fast or /mnt.fastdisk/a.fast /a/slow point to /mnt.slowdisk/a/slow or /mnt.slowdisk/a.slow /b/fast point to /mnt.fastdisk/b/fast or /mnt.fastdisk/b.fast /b/slow point to /mnt.slowdisk/b/slow or /mnt.slowdisk/b.slow and we get all fast directories on the fast drive without having to set up a partition for all 4 directories. The second (right hand) alternative gives us a flatter files system which in this case can make it simpler to keep an overview of the structure. The disadvantage is that it is a complicated scheme to set up and plan in the first place and that all mount point and partitions have to be defined before the system installation. 9. Implementation Having done the layout you should now have a detailled description on what goes where. Most likely this will be on paper but hopefully someone will make a more automated system that can deal with everything from the design, through partitioning to formatting and installation. This is the route one will have to take to realise the design. Modern distributions come with installation tools that will guide you through partitioning and formatting and also set up /etc/fstab for you automatically. For later modifications, however, you will need to understand the underlying mechanisms. 9.1. Drives and Partitions When you start DOS or the like you will find all partitions labeled C: and onwards, with no differentiation on IDE, SCSI, network or whatever type of media you have. In the world of Linux this is rather different. During booting you will see partitions described like this: ______________________________________________________________________ Dec 6 23:45:18 demos kernel: Partition check: Dec 6 23:45:18 demos kernel: sda: sda1 Dec 6 23:45:18 demos kernel: hda: hda1 hda2 ______________________________________________________________________ SCSI drives are labelled sda, sdb, sdc etc, and (E)IDE drives are labelled hda, hdb, hdc etc. There are also standard names for all devices, full information can be found in /dev/MAKEDEV and ./kernel/Documentation/devices.tex. Partitions are labelled numerically for each drive hda1, hda2 and so on. These are then mounted according to the file /etc/fstab before they appear as a part of the file system. 9.2. Partitioning First you have to partition each drive into a number of separate partitions. Under Linux there are two main methods, fdisk and the more screen oriented cfdisk. These are complex programs, read the manual very carefully. Under DOS there are other choices, mainly the version of fdisk that is bundled with for instance DOS, or fips. The latter has the unique advantage here that it can repartition a drive without necessarily damaging existing data, unlike all the other partitioning programs. In order to get the most out of fips you should first defragment your drive. This way you can allocate more space to other partitions. Nevertheless, it is important you do a full backup of all your valued data before partitioning. Partitions come in 3 flavours, primary, extended and logical. You have to use primary partitions for booting, but there is a maximum of 4 primary partitions. If you want more you have to define a extended partition within which you define your logical partitions. Each partition has an identifier number which tells the operating system what it is, for Linux the types swap and ext2fs are the ones you will need to know. There is a readme file that comes with fdisk that gives more in-depth information on partitioning. 9.3. Multiple devices (md) Being in a state of flux you should make sure to read the latest documentation on this kernel feature. It is not yet stable, beware. Briefly explained it works by adding partitions together into new devices md0, md1 etc. using mdadd before you activate them using mdrun. This process can be automated using the file /etc/mdtab. Then you then treat these like any other partition on a drive. Proceed with formatting etc. as described below using these new devices. 9.4. Formatting Next comes partition formatting, putting down the data structures that will describe the files and where they are located. If this is the first time it is recommended you use formatting with verify. Strictly speaking it should not be necessary but this exercises the IO hard enough that it can uncover potential problems, such as incorrect termination, before you store your precious data. Look up the command mkfs for more details. linux can support a great number of file systems, rather than repeating the details you can read the manpage for fs which describes them in some details. Note that your kernel has to have the drivers compiled in or made as modules in order to be able to use these features. When the time comes for kernel compiling you should read carefully through the file system feature list. If you use make menuconfig you can get online help for each file system type. Note that some rescue disk systems require minix, msdos and ext2fs to be compiled into the kernel. Also swap partitions have to be prepared, and for this you use mkswap. 9.5. Mounting Data on a partition is not available to the file system until it is mounted on a mount point. This can be done manually using mount or automatically during booting by adding appropriate lines to /etc/fstab. Read the manual for mount and pay close attention to the tabulation. 10. Maintenance It is the duty of the system manager to keep an eye on the drives and partitions. Should any of the partitions overflow, the system is likely to stop working properly, no matter how much space is available on other partitions, until space is reclaimed. Partitions and disks are easily monitored using df and should be done frequently, perhaps using a cron job or some other general system management tool. Do not forget the swap partitions, these are best monitored using one of the memory statistics programs such as free or top. Drive usage monitoring is more difficult but it is important for the sake of performance to avoid contention - placing too much demand on a single drive if others are available and idle. It is important when installing software packages to have a clear idea where the various files go. As previously mentioned GCC keeps binaries in a library directory and there are also other programs that for historical reasons are hard to figure out, X11 for instance has an unusually complex structure. 10.1. Backup The observant reader might have noticed a few hints about the usefulness of making backups. Horror stories are legio about accidents and what happened to the person responsible when the backup turned out to be non-functional or even non existent. You might find it simpler to invest in proper backups than a second, secret identity. There are many options and also a mini-HOWTO ( Backup-With-MSDOS ) detailling what you need to know. In addition to the DOS specifics it also contains general information and further leads. In addition to making these backups you should also make sure you can restore the data. Not all systems verify that the data written is correct and many administrators have started restoring the system after an accident happy in the belief that everything is working, only to discover to their horror that the backups were useless. Be careful. 10.2. Defragmentation This is very dependent on the file system design, some suffer fast and nearly debilitating fragmentation. Fortunately for us, ext2fs does not belong to this group and therefore there has been very little talk about making a defragmentation tool. If for some reason you feel this is necessary, the quick and easy solution is to do a backup and a restore. If only a small area is affected, for instance the home directories, you could tar it over to a temporary area on another partition, delete the original and then untar it back again. 10.3. Upgrades No matter how large your drives, time will come when you will find you need more. As technology progresses you can get ever more for your money. At the time of writing this, it appears that 5GB drives gives you the most bang for your bucks. Note that with IDE drives you might have to remove an old drive, as the maximum number supported on your mother board is normally only 2 or some times 4. With SCSI you can have up to 7 for narrow (8-bit) SCSI or up to 15 for wide (15 bit) SCSI, per channel. Some host adapters can support more than a single channel. My personal recommendation is that you will most likely be better off with SCSI in the long run. The question comes, where should you put this new drive? In many cases the reason for expansion is that you want a larger spool area, and in that case the fast, simple solution is to mount the drive somewhere under /var/spool. On the other hand newer drives are likely to be faster than older ones so in the long run you might find it worth your time to do a full reorganizing, possibly using your old design sheets. 11. Further Information There is wealth of information one should go through when setting up a major system, for instance for a news or general Internet service provider. The FAQs in the following groups are useful: News groups o Storage <news:comp.arch.storage>. o PC storage <news:comp.sys.ibm.pc.hardware.storage>. o AFS <news:alt.filesystems.afs>. o SCSI <news:comp.periphs.scsi>. o Linux setup <news:comp.os.linux.setup>. Mailing lists raid, linux-scsi, ext2fs ... HOWTO Bootdisk, Installation, , SCSI, UMSDOS ... mini-HOWTO Backup-With-MSDOS, Diskless, LILO, Linux+DOS+Win95+OS2, Linux+OS2+DOS, Linux+Win95, NFS-Root, Win95+Win+Linux, ZIP Drive ... The old Linux Large IDE mini-HOWTO is no longer valid, instead read /usr/src/linux/drivers/block/README.ide or /usr/src/linux/Documentation/ide.txt. The kernel source is, of course, the ultimate documentation. In other words, use the source, Luke. Much of the work here is based on the Filesystem Structure Standard (FSSTND). It has changed name to File Hierarchy Standard (FHS) and is less Linux specific. The maintainer has set up a home page <http://www.pathname.com/fhs> which tells you how to join the currently private mailing list, where the development takes place. Many mailing lists are at vger.rutgers.edu but this is notoriously overloaded, so try to find a mirror. There are some lists mirrored at The Redhat Home Page <http://www.redhat.com>. If you want to find out more about the lists available you can send a message with the line lists to the list server. The lists linux-raid and linux-scsi are of particular interest. A few project pages: o Mike Neuffer, the author of the DPT controller drivers, has some interesting pages on SCSI <http://www.i-connect.net/~mike/scsi> and DPT <http://www.i-connect.net/~mike/scsi/dpt>. o Raid 1 development information can be found at Raid 1 development page <http://www.nucleu.unam.mx/~miguel/raid>. o Mark D. Roth has information on VPS <http://www.uiuc.edu/ph/www/roth> o A similar kind of project on an Enhanced File System <http://www.i- connect.net/~mike/scsi> Please let me know if you have any other lead that can be of interest. Remember you can also use the web search engines and that some, like Altavista <http://www.altavista.digital.com> and Excite <http://www.excite.com> and Hotbot <http://www.hotbot.com> can also search usenet news. Also remember that Dejanews <http://www.dejanews.com> is a dedicated news searcher that keeps a news spool from early 1995 and onwards. If you have to ask for help you are most likely to get help in the comp.os.linux.setup news group. Due to large workload and a slow network connection I am not able to follow that newsgroup so if you want to contact me you have to do so by e-mail. 12. Concluding Remarks Disk tuning and partition decisions are difficult to make, and there are no hard rules here. Nevertheless it is a good idea to work more on this as the payoffs can be considerable. Maximizing usage on one drive only while the others are idle is unlikely to be optimal, watch the drive light, they are not there just for decoration. For a properly set up system the lights should look like Christmas in a disco. Linux offers software RAID but also support for some hardware base SCSI RAID controllers. Check what is available. As your system and experiences evolve you are likely to repartition and you might look on this document again. Additions are always welcome. 12.1. Coming Soon There are a few more important things that are about to appear here. In particular I will add more example tables as I am about to set up two fairly large and general systems, one at work and one at home. These should give some general feeling on how a system can be set up for either of these two purposes. Examples of smooth running existing systems are also welcome. There is also a fair bit of work left to do on the various kinds of file systems and utilities. There will be a big addition on drive technologies coming soon as well as a more in depth description on using fdisk or cfdisk. The file systems will be beefed up as more features become available as well as more on RAID and what directories can benefit from what RAID level. Also I hope to get some information from DPT who make the only RAID controller supported by Linux so far. I have contacted them but have yet to hear from them. There is some minor overlapping with the Linux Filesystem Structure Standard that I hope to integrate better soon, which will probably mean a big reworking of all the tables at the end of this document. When the new version is released there will be a substantial rewrite of some of the sections in this HOWTO but no release date has been announced yet. When the new standard appear various details such as directory names, sizes and file placements will be changed. I have made the assumption that the first partition starts at track 0 and that this track is the innermost track. That, however, is looking more and more like an unwarranted assumption, and not only because of the logical re-mapping that takes place. More on this when information becomes available. As more people start reading this I should get some more comments and feedback. I am also thinking of making a program that can automate a fair bit of this decision making process and although it is unlikely to be optimum it should provide a simpler, more complete starting point. 12.2. Request for Information It has taken a fair bit of time to write this document and although most pieces are beginning to come together there are still some information needed before we are out of the beta stage. o More information on swap sizing policies is needed as well as information on the largest swap size possible under the various kernel versions. o How common is drive or file system corruption? So far I have only heard of problems caused by flaky hardware. o References to speed and drives is needed. o Are any other Linux compatible RAID controllers available? o Leads to file system, volume management and other related software is welcome. o What relevant monitoring, management and maintenance tools are available? o General references to information sources are needed, perhaps this should be a separate document? o Usage of /tmp and /var/tmp has been hard to determine, in fact what programs use which directory is not well defined and more information here is required. Still, it seems at least clear that these should reside on different physical drives in order to increase parallelicity. 12.3. Suggested Project Work Now and then people post on comp.os.linux.*, looking for good project ideas. Here I will list a few that comes to mind that are relevant to this document. Plans about big projects such as new file systems should still be posted in order to either find co-workers or see if someone is already working on it. Planning tools that can automate the design process outlines earlier would probably make a medium sized project, perhaps as an exercise in constraint based programming. Partitioning tools that take the output of the previously mentioned program and format drives in parallel and apply the appropriate symbolic links to the directory structure. It would probably be best if this were integrated in existing system installation software. The drive partitioning setup used in Solaris is an example of what it can look like. Surveillance tools that keep an eye on the partition sizes and warn before a partition overflows. Migration tools that safely lets you move old structures to new (for instance RAID) systems. This could probably be done as a shell script controlling a back up program and would be rather simple. Still, be sure it is safe and that the changes can be undone. 13. Questions and Answers This is just a collection of what I believe are the most common questions people might have. Give me more feedback and I will turn this section into a proper FAQ. o Q: I have a single drive, will this HOWTO help me? A: Yes, although only to a minor degree. Still, the section on ``Physical Track Positioning'' will give you some gains. o Q: Are there any disadvantages in this scheme? A: There is only a minor snag: if even a single partition overflows the system might stop working properly. The severity depends of course on what partition is affected. Still this is not hard to monitor, the command df gives you a good overview of the situation. Also check the swap partition(s) using free to make sure you are not about to run out of virtual memory. o Q: OK, so should I split the system into as many partitions as possible for a single drive? A: No, there are several disadvantages to that. First of all maintenance becomes needlessly complex and you gain very little in this. In fact if your partitions are too big you will seek across larger areas than needed. This is a balance and dependent on the number of physical drives you have. o Q: Does that mean more drives allows more partitions? A: To some degree, yes. Still, some directories should not be split off from root, check out the file system standard (soon released under the name File Hierarchy Standard) for more details. o Q: What if I have many drives I want to use? A: If you have more than 3-4 drives you should consider using RAID of some form. Still, it is a good idea to keep your root partition on a simple partition without RAID, see the section on ``RAID'' for more details. o Q: I have installed the latest Windows95 but cannot access this partition from within the Linux system, what is wrong? A: Most likely you are using FAT32 in your windows partition. It seems that Microsoft decided we needed yet another format, and this was introduced in their latest version of Windows95. The advantage is that this format is better suited to large drives. Unfortunately there is no stable driver for Linux out yet . A test version is out but not yet in the standard kernel. You might also be interested to hear that Microsoft NT 4.0 does not support it yet either. Until a stable version is available you can avoid this problem by installing Windows95 over an existing FAT16 partition, made for instance by an older installation of DOS. This forces the Windows95 to use FAT16 which is supported by Linux. o Q: I cannot get the disk size and partition sizes to match, something is missing. What has happened? It is possible you have mounted a partition onto a mount point that was not an empty directory. Mount points are directories and if it is not empty the mounting will mask the contents. If you do the sums you will see the amount of disk space used in this directory is missing from the observed total. To solve this you can boot from a rescue disk and see what is hiding behind your mount points and remove or transfer the contents by mounting th offending partition on a temporary mounting point. You might find it useful to have "spare" emergency mounting points ready made. o Q: What is this nyx that is mentioned several times here? A: It is a large free Unix system with currently about 5000 users. I have use it for my web pages for this HOWTO as well as a source of ideas for a setup of large Unix systems. It has been running for many years and has a quite stable setup. For more information you can view the Nyx homepage <http://www.nyx.net> which also gives you information on how to get your own free account. 14. Bits and Pieces This is basically a section where I stuff all the bits I have not yet decided where should go, yet that I feel is worth knowing about. It is a kind of transient area. 14.1. Combining swap and /tmp Recently there have been discussions in the various linux related news groups about specialized file systems for temporary storage. This is partly inspired by the tmpfs on *BSD* and Solaris, as well as swapfs on the NeXT machines. The rationale is that these are temporary storage that normally does not require much space, yet in normal systems you need to reserve a certain amount of space for these. Elementary statistical knowledge tells you (very simplified) that when you sum a number of variables the relative statistical uncertainty decreases. So combining swap and /tmp you do not need to reserve as much space as you otherwise would need. These specialized file system is nothing more than a swappable RAM disk that are swapped out to disk when and only when space is limited, thus effectively putting temporary files on the swap partition. There is, however, a snag. This scheme prevents you from getting parallel activity on swap and /tmp drives so under heavy activity the system takes a bigger performance hit. Put another way, you trade speed to get space. Interleaving across multiple drives reduces this somewhat. 14.2. Interleaved swap drives. This is not striping across several drives, instead drives are accessed in a round robin fashion in order to spread the load in a crude fashion. In Linux you additionally have a priority parameter you can adjust for tuning your system, especially useful if your disks differs significantly in speed. Check man 8 swapon as well as man 2 swapon for more information. 14.3. Swap partition: to use or not to use In many cases you do not need a swap partition, for instance if you have plenty of RAM, say, more than 64MB, and you are the sole user of the machine. In this case you can experiment running without a swap partition and check the system logs to see if you ran out of virtual memory at any point. Removing swap partitions have two advantages: o you save disk space (rather obvious really) o you save seek time as swap partitions otherwise would lie in the middle of your disk space. In the end, having a swap partition is like having a heated toilet: you do not use it very often, but you sure appreciate it those few times you require it. 14.4. Mount point and mnt In an earlier version of this document I proposed to put all permanently mounted partitions under /mnt. That, however, is not such a good idea as this itself can be used as a mount point, which leads to all mounted partitions becoming unavailable. Instead I will propose mounting straight from root using a meaningful name like /mnt.descriptive-name. 14.5. SCSI id numbers and names Partitions are labeled in the order they are found, not depending on the SCSI id number. This means that if you add a drive with an id number inserted in the previous order of numbers, or change id number in any other way, the partition names will be messed up. This is important if you use removable media. In order to save yourself from some unpleasant experiences, you are recommended to use low numbers for fixed media and reserve the last number(s) for removable media drives. Many have been bitten by this misfeature and there is a strong call for something to be done about it. Nobody knows how soon this will be fixed so in the meantime it is wise to take this into consideration when you design your system. 14.6. Dejanews This is an Internet system that no doubt most of you are familiar with. It searches and serves Usenet News articles from 1995 and to the latest postings and also offers a web based reading and posting interface. There is a lot more, check out Dejanews <http://www.dejanews.com> for more information. What perhaps is less known, is that they use a pair of Linux SMP computers with 256MB RAM and a disk farm of a few hundred GB for this service. Just in case: this is not an advertisement, it is stated as an example of how much is required for what is a major Internet service. 14.7. File system structure There are many file system structures in existence, differing with FSSTND (and soon FHS) to varying degree both in terms of philosophy, strategy and implementation. It is not possible to detail all here, instead the interested reader should read the relevant manual page, man hier which is available on many platforms and implementations. 15. Appendix A: Partitioning layout table: mounting and linking The following table is designed to make layout a simpler paper and pencil exercise. It is probably best to print it out (using NON PROPORTIONAL fonts) and adjust the numbers until you are happy with them. Mount point is what directory you wish to mount a partition on or the actual device. This is also a good place to note how you plan to use symbolic links. The size given corresponds to a fairly big Debian 1.2.6 installation. Other examples are coming later. Mainly you use this table to select what structure and drives you will use, the partition numbers and letters will come from the next two tables. Directory Mount point speed seek transfer size SIZE swap __________ ooooo ooooo ooooo 32 ____ / __________ o o o 20 ____ /tmp __________ oooo oooo oooo ____ /var __________ oo oo oo 25 ____ /var/tmp __________ oooo oooo oooo ____ /var/spool __________ ____ /var/spool/mail __________ o o o ____ /var/spool/news __________ ooo ooo oo ____ /var/spool/____ __________ ____ ____ ____ ____ /home __________ oo oo oo ____ /usr __________ 500 ____ /usr/bin __________ o oo o 250 ____ /usr/lib __________ oo oo ooo 200 ____ /usr/local __________ ____ /usr/local/bin __________ o oo o ____ /usr/local/lib __________ oo oo ooo ____ /usr/local/____ __________ ____ /usr/src __________ o oo o 50 ____ DOS __________ o o o ____ Win __________ oo oo oo ____ NT __________ ooo ooo ooo ____ /mnt.___/_____ __________ ____ ____ ____ ____ /mnt.___/_____ __________ ____ ____ ____ ____ /mnt.___/_____ __________ ____ ____ ____ ____ /___/___/_____ __________ ____ ____ ____ ____ /___/___/_____ __________ ____ ____ ____ ____ /___/___/_____ __________ ____ ____ ____ ____ Total capacity: 16. Appendix B: Partitioning layout table: numbering and sizing This table follows the same logical structure as the table above where you decided what disk to use. Here you select the physical tracking, keeping in mind the effect of track positioning mentioned earlier in ``Physical Track Positioning''. the final partition number will come out of the table after this. Directory sda sdb sdc hda hdb hdc ___ swap | | | | | | | / | | | | | | | /tmp | | | | | | | /var : : : : : : : /var/tmp | | | | | | | /var/spool : : : : : : : /var/spool/mail | | | | | | | /var/spool/news : : : : : : : /var/spool/____ | | | | | | | /home | | | | | | | /usr | | | | | | | /usr/bin : : : : : : : /usr/lib | | | | | | | /usr/local : : : : : : : /usr/local/bin | | | | | | | /usr/local/lib : : : : : : : /usr/local/____ | | | | | | | /usr/src : : : : DOS | | | | | | | Win : : : : : : : NT | | | | | | | /mnt.___/_____ | | | | | | | /mnt.___/_____ : : : : : : : /mnt.___/_____ | | | | | | | /___/___/_____ | | | | | | | /___/___/_____ : : : : : : : /___/___/_____ | | | | | | | Total capacity: 17. Appendix C: Partitioning layout table: partition placement This is just to sort the partition numbers in ascending order ready to input to fdisk or cfdisk. Here you take physical track positioning into account when finalizing your design. These numbers and letters are then used to update the previous tables, all of which you will find very useful in later maintenance. Drive : sda sdb sdc hda hdb hdc ___ Total capacity: | | | | | | | Partition 1 | | | | | | | 2 : : : : : : : 3 | | | | | | | 4 : : : : : : : 5 | | | | | | | 6 : : : : : : : 7 | | | | | | | 8 : : : : : : : 9 | | | | | | | 10 : : : : : : : 11 | | | | | | | 12 : : : : : : : 13 | | | | | | | 14 : : : : : : : 15 | | | | | | | 16 : : : : : : : 18. Appendix D: Example: Multipurpose server The following table is from the setup of a medium sized multipurpose server where I work. Aside from being a general Linux machine it will also be a network related server (DNS, mail, FTP, news, printers etc.) X server for various CAD programs, CD ROM burner and many other things. The files reside on 3 SCSI drives with a capacity of 600, 1000 and 1300 MB. Some further speed could possibly be gained by splitting /usr/local from the rest of the /usr system but we deemed the further added complexity would not be worth it. With another couple of drives this could be more worthwhile. In this setup drive sda is old and slow and could just a well be replaced by an IDE drive. The other two drives are both rather fast. Basically we split most of the load between these two. To reduce dangers of imbalance in partition sizing we have decided to keep /usr/bin and /usr/local/bin in one drive and /usr/lib and /usr/local/lib on another separate drive which also affords us some drive parallelizing. Even more could be gained by using RAID but we felt that as a server we needed more reliability than is currently afforded by the md patch and a dedicated RAID controller was out of our reach. 19. Appendix E: Example: mounting and linking Directory Mount point speed seek transfer size SIZE swap sdb2, sdc2 ooooo ooooo ooooo 32 2x64 / sda2 o o o 20 100 /tmp sdb3 oooo oooo oooo 300 /var __________ oo oo oo ____ /var/tmp sdc3 oooo oooo oooo 300 /var/spool sdb1 436 /var/spool/mail __________ o o o ____ /var/spool/news __________ ooo ooo oo ____ /var/spool/____ __________ ____ ____ ____ ____ /home sda3 oo oo oo 400 /usr sdb4 230 200 /usr/bin __________ o oo o 30 ____ /usr/lib -> libdisk oo oo ooo 70 ____ /usr/local __________ ____ /usr/local/bin __________ o oo o ____ /usr/local/lib -> libdisk oo oo ooo ____ /usr/local/____ __________ ____ /usr/src ->/home/usr.src o oo o 10 ____ DOS sda1 o o o 100 Win __________ oo oo oo ____ NT __________ ooo ooo ooo ____ /mnt.libdisk sdc4 oo oo ooo 226 /mnt.cd sdc1 o o oo 710 /mnt.___/_____ __________ ____ ____ ____ ____ /___/___/_____ __________ ____ ____ ____ ____ /___/___/_____ __________ ____ ____ ____ ____ /___/___/_____ __________ ____ ____ ____ ____ Total capacity: 2900 MB 20. Appendix F: Example: numbering and sizing Here we do the adjustment of sizes and positioning. Directory sda sdb sdc swap | | 64 | 64 | / | 100 | | | /tmp | | 300 | | /var : : : : /var/tmp | | | 300 | /var/spool : : 436 : : /var/spool/mail | | | | /var/spool/news : : : : /var/spool/____ | | | | /home | 400 | | | /usr | | 200 | | /usr/bin : : : : /usr/lib | | | | /usr/local : : : : /usr/local/bin | | | | /usr/local/lib : : : : /usr/local/____ | | | | /usr/src : : : : DOS | 100 | | | Win : : : : NT | | | | /mnt.libdisk | | | 226 | /mnt.cd : : : 710 : /mnt.___/_____ | | | | /___/___/_____ | | | | /___/___/_____ : : : : /___/___/_____ | | | | Total capacity: | 600 | 1000 | 1300 | 21. Appendix G: Example: partition placement This is just to sort the partition numbers in ascending order ready to input to fdisk or cfdisk. Drive : sda sdb sdc Total capacity: | 600 | 1000 | 1300 | Partition 1 | 100 | 436 | 710 | 2 : 100 : 64 : 64 : 3 | 400 | 300 | 300 | 4 : : 200 : 226 : 22. Appendix H: Example II The following is an example of a server setup in an academic setting, and is contributed by nakano@apm.seikei.ac.jp. I have only done minor editing to this section. /var/spool/delegate is a directory for storing logs and cache files of an WWW proxy server program, "delegated". Since I don't notice it widely, there are 1000--1500 requests/day currently, and average disk usage is 15--30% with expiration of caches each day. /mnt.archive is used for data files which are big and not frequently referenced such a s experimental data (especially graphic ones), various source archives, and Win95 backups (growing very fast...). /mnt.root is backup root file system containing rescue utilities. A boot floppy is also prepared to boot with this partition. ================================================= Directory sda sdb hda swap | 64 | 64 | | / | | | 20 | /tmp | | | 180 | /var : 300 : : : /var/tmp | | 300 | | /var/spool/delegate | 300 | | | /home | | | 850 | /usr | 360 | | | /usr/lib -> /mnt.lib/usr.lib /usr/local/lib -> /mnt.lib/usr.local.lib /mnt.lib | | 350 | | /mnt.archive : : 1300 : : /mnt.root | | 20 | | Total capacity: 1024 2034 1050 ================================================= Drive : sda sdb hda Total capacity: | 1024 | 2034 | 1050 | Partition 1 | 300 | 20 | 20 | 2 : 64 : 1300 : 180 : 3 | 300 | 64 | 850 | 4 : 360 : ext : : 5 | | 300 | | 6 : : 350 : : Filesystem 1024-blocks Used Available Capacity Mounted on /dev/hda1 19485 10534 7945 57% / /dev/hda2 178598 13 169362 0% /tmp /dev/hda3 826640 440814 343138 56% /home /dev/sda1 306088 33580 256700 12% /var /dev/sda3 297925 47730 234807 17% /var/spool/delegate /dev/sda4 363272 170872 173640 50% /usr /dev/sdb5 297598 2 282228 0% /var/tmp /dev/sdb2 1339248 302564 967520 24% /mnt.archive /dev/sdb6 323716 78792 228208 26% /mnt.lib Apparently /tmp and /var/tmp is too big. These directories shall be packed together into one partition when disk space shortage comes. /mnt.lib is also seemed to be, but I plan to install newer TeX and ghostscript archives, so /usr/local/lib may grow about 100M or so (since we must use Japanese fonts!). Whole system is backed up by Seagate Tapestore 8000 (Travan TR-4, 4G/8G). 23. Appendix H: Example III: SPARC Solaris The following section is the basic design used at work for a number of Sun SPARC servers running Solaris 2.5.1 in an industrial development environment. It serves a number of database and cad applications in addition to the normal services such as mail. Simplicity is emphasized here so /usr/lib has not been split off from /usr. This is the basic layout, planned for about 100 users. Drive: SCSI 0 SCSI 1 Partition Size (MB) Mount point Size (MB) Mount point 0 160 swap 160 swap 1 100 /tmp 100 /var/tmp 2 400 /usr 3 100 / 4 50 /var 5 6 remainder /local0 remainder /local1 Due to specific requirements at this place it is at times necessary to have large partitions available on a short notice. Therefore drive 0 is given as many tasks as feasible, leaving a large /local1 partition. This setup has been in use for some time now and found satisfactorily. For a more general system it would be better to swap /tmp and /var/tmp and then more /var to drive 1.